Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

174

Applications in Computer Vision

6.5.1

Preliminaries

In a speciﬁc convolution layer, w ∈R^C^out^×^Cⁱⁿ^×^K^×^K, ain ∈R^Cⁱⁿ^×^Wⁱⁿ^×^Hⁱⁿ, and aout ∈

R^C^out^×^W^out^×^H^outrepresent its weights and feature maps, where Cin and Cout represents the

number of channels. (H, W) are the height and width of the feature maps, and K denotes

the size of the kernel. Then we have the following.

aout = ain ⊗w,

(6.78)

where ⊗is the convolution operation. We omit the batch normalization (BN) and ac-

tivation layers for simplicity. The 1-bit model aims to quantize w and ain into b^w∈

{−1, +1}^C^out^×^Cⁱⁿ^×^K^×^Kand b^aⁱⁿ∈{−1, +1}^Cⁱⁿ^×^H^×^Wusing eﬃcient XNOR and Bit-count

operations to replace full-precision operations. Following [48], the forward process of the 1-

bit CNN is

aout = α ◦b^aⁱⁿ⊙b^w,

(6.79)

where ⊙is the XNOR, and bit-count operations, and ◦denotes channel-wise multiplication.

α = [α1, · · · , αCout] ∈R+ is the vector consisting of channel-wise scale factors. b = sign(·)

denotes the binarized variable using the sign function, which returns 1 if the input is greater

than zero and -1 otherwise. It then enters several non-linear layers, e.g., BN layer, non-

linear activation layer, and the max-pooling layer. We omit these for simpliﬁcation. Then,

the output aout is binarized to b^a^outvia the sign function. The fundamental objective of

BNNs is to calculate w. We want it to be as close as possible before and after binarization

to minimize the binarization eﬀect. Then, we deﬁne the reconstruction error as

LR(w, α) = w −α ◦b^w.

(6.80)

6.5.2

Select Proposals with Information Discrepancy

To eliminate the large magnitude scale diﬀerence between the real valued teacher and the

1-bit student, we introduce a channelwise transformation for the proposals¹of the inter-

mediate neck. We ﬁrst apply a transformation ϕ(·) on a proposal ^˜Rn ∈R^C^×^W^×^Hand

have

Rn;c(x, y) = ϕ( ^˜Rn;c(x, y)) =

exp(

Rn;c(x,y)

)

(x^′,y^′)∈(W,H) ^exp(

Rn;c(x^′y^′)

)

(6.81)

where (x, y) ∈(W, H) denotes a speciﬁc spatial location (x, y) in the spatial range (W, H),

and c ∈{1, · · · , C} is the channel index. n ∈{1, · · · , N} is the proposal index. N denotes the

number of proposals. T denotes a hyper-parameter controlling the statistical attributions

of the channel-wise alignment operation². After the transformation, the features in each

channel of a proposal are projected into the same feature space [231] and follow a Gaussian

distribution as

p(Rn;c) ∼N(μn;c, σ²

n;c⁾^.

(6.82)

We further evaluate the information discrepancy between the teacher and the student

proposals. As shown in Fig. 6.16, the teacher and the student have NT and NS proposals,

respectively. Every proposal in one model generates a counterpart feature map patch in the

same location as in the other model. Thus, total NT + NS proposal pairs are considered.

To evaluate the information discrepancy, we introduce the Mahalanobis distance of each

1In this paper, the proposal denotes the neck/backbone feature map patched by the region proposal of

detectors.

2In this section, we set T = 4.